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Depression is an acute problem throughout the world. Due to worst and 
prolong depression many people dies in every year. The problem is that most 
of the people are not concern of the fact that they are suffering from 
depression. In this research, our aim was to find out whether an individual is 
depressed or not by analyzing social media status. Therefore, we focused on 
real data. Our dataset consists of 2,000 sentences, which was collected from 
different social media platforms Facebook, Twitter, and Instagram. Then, we 
have performed five data pre-processing approaches for natural language 
processing (NLP) such as tokenization, removal of stop words, removing 
empty string, removing punctuations, stemming and lemmatization. For our 
selected model, we considered that processed data as an input. Finally, we 
applied six machine learning (ML) classifiers multinomial Naive Bayes 
(NB), logistic regression, liner support vector classifier, random forest, 
K-nearest neighbour, and decision tree to achieve better accuracy over our 
dataset. Among six algorithms, multinomial NB and logistic regression 


performed well on our dataset and obtained 98% accuracy. 
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1. INTRODUCTION 

Depression is a usual and significant medical disease which negatively affects a person including 
their feeling and thinking approach. It is the key reason of inability and for suicidal attempt worldwide. 
Anxiety and other psychological disorders are connected with depression. Depression affected our social life 
and control our feelings and behavior. Around 80,000 people died for this acute disorder [1]. In addition, 
Asian people are more sufferer by this single disease. It also a concern that many country do not consider 
depression as a disease therefore it is still under-diagnosed [2]. People are losing their ability to do their daily 
work due to depression and this is pushing them towards more severe depression. The biggest problem is that 
the people who are depressed do not know that they are suffering from it. As a result, they gradually suffer 
from several mental and physical damages. Research also proposed, if the following symptoms appear only 
once a day for 2 weeks then diagnosis is essential [3], [4]: i) feeling of incompetence and worthlessness, 
ii) decrease of diurnal activity, iii) changes the pattern of sleep, iv) suicidal perception, v) reckless behavior 
or excessive anger and irritability, and vi) changes in appetite. 

People are now expressing their thoughts and feelings through social media where their mental 
feelings are expressed through text, pictures and videos. Text data has become an extensively used and 
popular medium of communication. On the other hand, machine learning classifiers are useful for getting 
information from data. Machine learning algorithms are mainly separated among three sections—supervised, 
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unsupervised and reinforcement learning. To identify depression from text data we have applied different 
supervised learning techniques. In a study, support vector regression (SVR) and multinomial Naive Bayes 
(NB) was applied and compared their accuracy with other authors. SVR performed better than multinomial 
NB and accuracy was 79.7% [5]. 

In another study, where dataset was created manually and 7,500 sentences were collected for the 
corpus. Here, NB and topical approach classifiers were applied to predict depression. Topical approach 
performed better than NB and its accuracy was 90% [6]. Twitter application programming interface (APT) is 
one of the major sources for data collection. On the other hand, recurrent neural network (RNN) performed 
better than most of the renowned classifiers. In another study, RNN, NB, and SVM was applied and RNN’s 
performance (90.3% accuracy) was better than others [7]. Almouzini et al. [8] also worked with twitter data. 
Here, they worked with western Asian countries and collected 7,000 tweets from 97 twitter users. 
LIBLINEAR classifier worked better for their dataset and among four classifiers 87.5% accuracy was 
obtained. 

In recent past, a study proposed six classifiers to identify saint and common form in Bengali text. 
Though they have used six classifiers, among all of those accuracy of NB was much efficient than other 
classifier’s. Futhermore, count-vectorizer, tokenizing words, removal of stop words, part-of-speech (POS) 
tagging were the major steps for data preprocessing [9]-[11]. Different libraries and tools such as natural 
language toolkit (NLTK), TextBlob, Waikato environment for knowledge analysis (WEKA), and Beautiful 
Soup had been used for data preprocessing [12], [13]. 

People with critical depression may have a negative thinking on life and try to take a dangerous 
attempt like suicide. In the paper, Coppersmith et al. [14] proposed a natural language processing and deep 
learning based suicidal risk prediction model and data was collected from social media. In this case, total 418 
users were observed who attempted suicide. In the paper, Caicedo et al. [15] reported a result on suicidal 
attempt. In this research they applied supervised classifier to predict a suicidal case from messages. They 
worked on 3,472 data. They applied several classifiers and initially they considered four categories. 
Depression prediction from speech is also a growing research area. In the paper, He and Cao [16] proposed 
an approach to predict depression using deep convolutional neural networks (DCNN) model. Their dataset 
was audio-visual emotion recognition challenge 2013 (AVEC2013) and AVEC2014 and got 91% accuracy. 
One study done by Priya et al. [17] predicted anxiety, depression and stress. For this purpose, they collected 
data through Google form and total 348 participants were involved. They chose five classifiers where NB 
performed better than other and obtained 85% accuracy. 

In the paper, Zaghouani [18] considered 3,200 Twitter data for predicting youth depression. They 
focused on Middle East and North Africa (MENA) region to collect data and to analyze depression. Here, 
natural language processing (NLP) tools along with machine learning algorithms were used to predict 
depression [18]. In the paper, Sharma et al. [19] used machine learning with natural language processing to 
predict suicidal issues among youth. They considered twitter data from kaggle.com and obtained 88% 
accuracy from term frequency and inverse document frequency (TF-IDF) approach. The performance of the 
classifier varies for different datasets. In the paper, Nigam et al. [20] proposed a machine learning based 
approach on sentimental analysis where sentiment 140 (Stanford University) was used as a dataset. Different 
machine learning algorithms were applied but logistic regression performed better and accuracy was 82.59%. 
For few cases, SVM performed much better than other classifiers. In the paper, Laoh et al. [21] worked on 
hotel review to find out a review whether it is positive or negative. SVM had performed better with the 
accuracy of 94%. For the same dataset he had applied recursive neural tensor network (RNTN) classifier but 
the accuracy was 85%. In the paper, Hassan et al. [22] analyzed sentiment from sentence to measure 
depression. A comparison among SVM, NB, and maximum entropy had made by them. Data was collected 
from Twitter and 20 news group. In this study, SVM, NB and ME performed and their following accuracy 
was 91%, 83% and 80% [23]. Twitter data is also used to predict personality according to Garg and Garg 
[24]. In this study authors used six machine learning classifiers. They worked on 17.464 user’s data. Here, 
four personality attitudes were considered to predict individual’s personality. 

In the paper, Patidar and Umre [25] proposed an approach to predict depression and analysis reason 
of the depression. For this study they collected total 1,695 data where stressed data were 535 and non- 
stressed were data 1,160. They applied naive Bayes algorithm to classify depression level. In the paper, 
Tadesse et al. [26] studied on Reddit social media forum to predict depression. They focused natural 
language processing along with machine learning algorithms to complete this work. Here, multilayer 
perceptron (MLP) classifier worked 91% accurately. They considered total 1841 posts for predicting 
depression where 1293 posts for depression related and 548 were standard posts. In the paper, Rustam et al. 
[27] considered 7528 tweets. They measured sentiment and applied different machine learning algorithms 
random forest (RF), XGBoost, support vector classifier (SVC), extra tree classifier (ETC) and decision tree. 
Among these ETC provided the best accuracy 92%. 
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To complete our study and to get our desire outcome which is showed in Figure 1, we have applied 
supervised machine learning techniques. The application process of this technique is easy to build and 
provide better results. Our dataset is labeled into two tags named as depressed and non-depressed. In recent 
days, depression is a major problem and young generation is in more vulnerable condition. Finally, our aim is 
to propose a model to predict depression and the objectives: i) the creation of a potential dataset for natural 
language processing, ii) an examination of a standard algorithm to predict depression and can be used for 
further research, and iii) evaluate the model performances based on proposed model and previous study. We 
organized our paper, in section 2 we discussed about the proposed method where data collection and data 
pre-processing procedure are the main part. In section 3, we compare and examine the results also present the 
better machine learning technique to predict depression. We discussed about our future work in section 4 
including conclusion and limitation of our study. 
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Figure 1. Structural design of our proposed model 


2. PROPOSED RESEARCH METHOD 

To find out our desire outcome, we have applied supervised learning method to predict depression 
form the text data. The working procedure is illustrating step by step in the following sections. Figure 2 
presented the work flow of our proposed work. 
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Figure 2. Flow chart of our proposed model 
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2.1. Data collection and analysis 


Data collection was one of our major challenges. We collected data from different social sites 
manually and insert into an excel file. Our dataset consists of total 2,000 sentences where 1,025 tagged as 
non-depressed sentences and 975 tagged as depressed sentences. Here, Table 1 shows the distribution of data 
sources and the number of data we collected. We have used count plot to visualize data which is given in 
Figure 3. We have collected all sample data through three social media sites which is mentioned in Table 1. 
We considered user’s comment and social status to create a potential dataset. After studying depression 
detection text and other related works, we find out the desire data. Different text has different meaning. 
Table 2 is created to present a sample dataset. 


Table 1. Description of data sources 
Social media sites Number ofuser Number of sentences 


Facebook 100 1050 
Instagram 70 450 
Twitter 40 500 

Total 210 2000 
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Figure 3. Data visualization 


Table 2. Sample data 


Sentence Label 
My heart keeps breaking. Depressed 
Life is awesome, overlook your difficulties, be cheerful, live your life, Feel so happy, Nice to see you, It is Non-Depressed 
surprising 
It is my bad luck, Do not upset, Everything will be fine, Life is just a drama, Need a tour, anyone want to join Depressed 
with me. 
Life is short, be happy, live your life, Enjoy your day, Think about yourself, Non-Depressed 


2.2. Data pre-processing 
We applied five data pre-processing techniques on our dataset. Text pre-processing helps us to 
increase the quality of the dataset as well as the accuracy of the classifiers. We performed following pre- 

processing techniques to increase the accuracy of a classifier, which is showed in Figure 4. 

— Data normalization: after importing our dataset we checked whether there was any null input or not. We 
found out two null inputs on our dataset then we had dropped those null inputs. All types of punctuations 
have been removed from our dataset as they do not put any impact in our research work. List of 
punctuations that we removed from our dataset are given in Table 3. 

— Data tokenization: tokenization is the method of splitting or tokenizing a string [9]. Words are the token 
of a sentence and the sentences are the token of a paragraph. We applied sentence to word tokenization on 
our dataset. Table 4 is an example of tokenization approach. 

— Removal of stop words: the most used words are known as stop words. Stop-word do not have any 
significant to identify depression level from a sentence [10]. Therefore, we removed those words from our 
dataset to reduce the unnecessary load. We built our own stop words considering the fact that they do not 
change the meaning of a sentence. Some of the examples are given in Table 5. 
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Figure 4. Data pre-processing approachs 


Table 3. Punctuation list 
List of punctuations 


‘ | ee # $ % 
eea a | - 
: anes ae — > 2. 
[ ae ee 
|e fog 


Table 4. Example of tokenization 


Before tokenization After tokenization 
My heart keeps breaking. [my, heart, keeps, breaking] 
Life is short, be happy, feel so happy __ [lifes, short, be, happy, feel, so, happy] 
Everything will be fine [Everything, will, be, fine] 
That's how I feel now. [That’s, how, I, feel, now.] 


Table 5. List of stop words 

Our customized list of Stop words 
'a', ‘about’, ‘above’, ‘across’, ‘after’, ‘again’, ‘all’, ‘almost’, ‘along’, ‘already’, ‘also’, ‘although’, 'always', 'am', ‘among’, 'amongst', ‘amongst’, 
‘amount’, 'an', 'and', ‘another’, 'any', 'anyhow', ‘anyone’, 'anything', 'anyway’, 'anywhere’, ‘are’, ‘around’, ‘as’, 'at', 'back', 'be', ‘became’, 
‘because’, become’, 'becomes', 'becoming’, ‘been’, ‘before’, ‘being’, 'below’, ‘beside’, 'besides', 'between', 'bill', both’, 'but', 'by’, ‘call’, 
‘can', 'co', ‘con’, 'de', ‘describe’, ‘detail’, 'each', 'eg', ‘eight’, 'either', ‘eleven’, 'else', 'elsewhere'’, 'etc', 'even', 'ever', 'every', 'everyone'’, 
‘everything’, ‘everywhere’, ‘he’, ‘hence’, ‘her’, ‘here’, ‘hereafter’, 'hereby', 'herein', 'hereupon’, ‘hers’, ‘herself’, 'him', 


— Empty string remove: empty string can be very sensitive while implementing the classifiers. It kills our 
memory space and we may get lower accuracy rate for a classifier. Therefore, we removed all empty 
string from our dataset. Table 6 to present an example of empty string remove. 

— Lemmatization: lemmatization is the method of removing inflectional endings from a particular word and 
it returns the base or the dictionary form of that word. It is also help us to increase the accuracy of an 
algorithm. Here, Table 7 to present the results of Lemmatization technique. 

— Data split: after applied all mentioned pre-processing techniques finally, we created a dataset for 
prediction depression. For getting more appropriate results, data split is also a major part of machine 
learning. Here, we considered 80% data for training purpose and 20% data to test the model’s 
performance. 


Table 6. Removing empty string 
Before removing empty string After removing empty string 


39 9999 e 9999. 


[“ You”, “”,“ are”, “ ”, “looking”,” ”, “ gorgeous”] [“You”,”are”,” looking”,” gorgeous” ] 
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Table 7. Example of lemmatization 


Before After 

Drinking Drink 
Is Be 
Done Do 
Goes Go 


2.3. Proposed model 

We proposed a model using machine learning techniques to analyze human depression through text 
analysis. Valence aware dictionary for sentiment reasoning (VADER) is a model for analyzing text for 
sentiment. We implemented VADER to analyze the polarity of each sentence but the accuracy was 85%. 
After that, we applied following algorithms to get the desired output multinomial naive Bayes (MNB), linear 
SVC, KNN, RF, decision tree, and logistic regression (LR). 


3. RESULTS AND DISCUSSION 

To achieve our desire goal, we have completed all experiments in python environment. Python has 
many necessary libraries to implement a complete system. The following requirements are also necessary to 
develop our proposed system-OS (Windows/Linux/Mac), RAM (minimum requirements of 4 GB), HDD 
(minimum requirements of 128 GB). 

We done all experiments based on six algorithms which we mentioned in previous section. Among 
six algorithms, MNB and LR provided the best result. MNB classifier is the collection of different 
algorithms. It is based on Bayes theorem. In NB classifiers all the algorithms share their common principle 
[24]. Our classification report successfully predicted the output with 98% accuracy. 

Performance measurement unit can express an algorithm more precisely. Fl-score is for controlling 
the balance between recall and precision. Precision is mentioning the ratio of total true positive value 
predicted and the total true positive class values predicted. Recall is also crucial to define a performance of 
an algorithm. It presents the number of positive value divided by the total number of true positives and false 
negatives [28]. All mentioned performance measurement units are defined: 


Precision = 
TR+FP 
TP 
Recall = —— 
TP+FN 


F1-Score = paPrecision * Recall 


Precision + Recall 
TP+TN 


Accuracy = —————_ 
Y = TP4TN+FP4FN 


For NB algorithm we calculated all those mentioned measurement units. We provided Table 8 for illustrating 
the best performance results of MNB classifier. 

By implementing VADER, we tried to find out the initial accuracy. However, the accuracy was 
85%. After that, we applied six above mentioned classifiers on our dataset to get more accurate result. 
Table 9 to represent the performance of our selected algorithms using four measurement unit. 


Table 8. Performance of MNB on our dataset 


Precision Recall Fl-score Support 
Depressed 0.96 0.99 0.98 195 
Non Depressed 0.99 0.97 0.98 205 
accuracy 0.98 400 
macro avg 0.98 0.98 0.98 400 
weighted avg 0.98 0.98 0.98 400 


Table 9. Compared different proposed classifiers 


Classifier Precision Recall Fl-score Accuracy 
Multinomial NB 0.98 0.98 0.98 98% 
LR 0.97 0.97 0.97 97% 
Linear SVC 0.96 0.97 0.96 96% 
RF 0.93 0.92 0.93 93% 
KNN 0.95 0.84 0.89 91% 
DT 0.92 0.86 0.89 90% 
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To illustrate the performance of a model we used receiver operating characteristic (ROC) curve. 
ROC curve is a graph for presenting accuracy based on a threshold value and the plot created with two 
parameters-true positive rate and false positive rate. We can also compare algorithms through this graph [29]. 
Figure 5 shows the ROC curve on our proposed algorithms. 
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Figure 5. ROC curve for four selected algorithms 


4. CONCLUSION 

Our aim was to predict depression by using different social media posts. Especially, people in poor 
and least developed countries are suffering from this disease. Therefore, they are suffering from depression 
for a long time which is reducing their ability to survive and it slowly pushing them towards death. A proper 
solution is needed to overcome from the level of depression along with its impact. Using our model it will be 
easy to identify depression. Our proposed model performed on 2,000 sample text data, and got 98% accuracy 
for predicting depression. Here, MNB and LR provided the best result. However, to solve and propose an 
acceptable result the amount of data is crucial. 

Still there are many ways to enhance our study. Bring out this research work in a mobile application 
will be a great further plan. That application will be helpful to check depression level at any time and people 
will be able to take necessary steps. A notification system can be used for generating awareness through 
social media. Moreover, we have a plan to identify how a patient can recover from depression and also want 
to work on large dataset to increase the accuracy rate as much as possible. Therefore, need to create a more 
appropriate dataset. 
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